Meridian — Education
Schemas & Embeddings from First Principles
What they are, why they matter, and the consequences of choosing wrong.
Part 1
What Is a Schema?

A schema is a contract. It says: every piece of knowledge stored in this system will have these exact fields, in these exact formats, every single time.

Think of it like a form. If you go to a hospital, every patient record has the same fields: name, date of birth, blood type, allergies, medications. They don't let one doctor write "Bob, he's 40ish, allergic to something" and another doctor write a structured record. The form IS the schema. It enforces consistency.

Why does this matter for AI?

Your sovereign AI stores thousands of pieces of knowledge. Principles, tactics, lessons learned, frameworks. If each one is stored differently — some with a confidence score, some without, some with a source name, some with a hex ID, some with mechanism explanations, some without — you can never reliably search, compare, or transfer knowledge between systems.

The schema is the interoperability protocol. If two Meridian builds use the same schema, a codex pack created by one can be installed into the other without any translation. If they use different schemas, every transfer requires a conversion step — and every conversion step is a place where data gets lost or corrupted.

A concrete example

Say you extract a principle from a marketing course:

"Lead with outcomes, not mechanisms, when selling to skeptical men."

Stored as NODE_SCHEMA:
  id:               "a7f3b2c1-..."          (unique forever)
  text:             "Lead with outcomes..."  (the actual principle)
  title:            "Outcomes before mechanisms"
  node_type:        "principle"
  source_id:        "Anatomy of Ads 2.0"    (where it came from)
  confidence_score: 0.92                    (how validated it is)
  tags:             "cold_traffic,identity,masculine"
  mechanism:        "Skeptical men evaluate outcome identity before caring about how-to"
  situation:        "Cold traffic ads for identity-based offers"
  when_not:         "Warm retargeting where credibility is established"

Every single principle in the system has these same fields. You can search by confidence. You can filter by tags. You can retrieve by situation. You can compare mechanisms. You can track where it came from. The schema makes the knowledge machine-readable, not just human-readable.

What happens without a schema

Rob's early system stored knowledge as "holons" — semi-structured blobs with varying fields. Some had sources, some didn't. Some had categories, some were just raw text. When you want to ask "show me all principles about cold traffic with confidence above 0.8" — you can't, because some holons don't have a confidence field, and some don't have category tags.

A schema solves this by requiring every field to exist on every record, even if it's empty. You always CAN query confidence, even if some nodes are at the default 0.75 because they haven't been validated yet.


Part 2
The Schemas That Exist Right Now

Your system (VOHU MANAH) has 4 schemas because it evolved over time. Each was created for a different purpose.

Schema A: NODE_SCHEMA — The Main One

14 fields. Used by 7 collections. This is the rich, fully structured schema for actual knowledge — principles, tactics, examples, book excerpts. Every field is intentional:

FieldWhy it exists
idUnique forever. Never changes. Lets you reference a specific principle across systems.
vectorThe embedding (we'll explain this in Part 3). Lets you search by meaning, not keywords.
textThe actual content. What gets embedded and what humans read.
titleShort label for display. "Outcomes before mechanisms" vs the full text.
node_typeWhat kind of knowledge this is. A principle (general truth), a tactic (specific action), a concept (abstract idea), an example (situated instance).
source_idWhere it came from. "Anatomy of Ads 2.0" — traceable back to the original material.
framework_idWhich cluster of related principles this belongs to. E.g. "cold_traffic" framework.
confidence_score0.0 to 1.0. How validated this principle is. Starts at extraction quality. Rises when the principle works in the real world. Falls when it doesn't.
tagsSearchable labels. "copywriting,cold_traffic,identity" — comma-separated.
mechanismHOW/WHY this works. The causal explanation. Not just "do this" but "this works because..."
situationWHEN to apply this. The context where this principle is valid.
when_notWHEN NOT to apply this. Just as important — prevents misapplication.
collectionWhich collection it lives in (principles, tactics, etc).
date_addedWhen it was ingested. For tracking recency.

Schema B: LEGACY_SCHEMA — The Simple One

7 fields. Used by 4 collections. Created earlier, for simpler document storage (positioning docs, reference sites). Doesn't have mechanism, situation, when_not, framework_id, or even an id field. It's basically: text + source + confidence + tags.

Problem: Legacy collections can't participate in codex exchange because they're missing the fields that make knowledge useful (mechanism, when_not, situation). A codex buyer can't use a principle that says "do this" without knowing when to do it and when NOT to.

Schema C: CONV_SCHEMA — Conversations

9 fields. Stores full chat conversations as JSON blobs. Completely different purpose — this is session history, not knowledge. Not part of codex exchange.

Schema D: EVERGREEN_SCHEMA — Synthesis Pages

16 fields. Stores synthesized long-form content (trunk, branches, leaves, threads). This is the OUTPUT of the synthesis pipeline, not atomic knowledge. Not part of codex exchange as-is.

Rob's System (GHOSTNET)

Rob uses a different storage format entirely. His 16,717 holons are stored in LanceDB but with a different schema — less structured than NODE_SCHEMA, more like LEGACY. His holons have: text, vector (768-dim — different embedding model), and varying metadata. No standardized mechanism/situation/when_not fields.

This is why SPEC-001 matters: For Meridian to work — for codex packs to transfer between builds, for the collective to synthesize across nodes — everyone must use the same schema. NODE_SCHEMA is the candidate. It's the richest, most validated (6,797 nodes in production), and already enforced with hard validation (wrong function → ValueError).

Part 3
What Is an Embedding?

This is the most important concept to understand. Everything else flows from it.

The problem: computers can't understand meaning

A computer sees "the dog sat on the mat" and "the canine rested on the rug" as completely different strings. Different characters, different lengths. To a computer doing string comparison, these have zero similarity.

But to a human, they mean the same thing.

An embedding is a way to convert meaning into numbers. Specifically, into a list of numbers (a "vector") where similar meanings produce similar numbers.

How it works (simplified)

An embedding model is a neural network that has been trained on billions of text examples. It learned that "dog" and "canine" appear in similar contexts, so they should map to similar numbers. It learned that "the dog sat on the mat" and "investment banking regulations" appear in completely different contexts, so they should map to very different numbers.

When you feed text into an embedding model, it outputs a list of numbers. Like this:

"Lead with outcomes, not mechanisms"  →  [0.23, -0.15, 0.87, 0.02, -0.41, ... ] (1024 numbers)
"Show results before explaining how"  →  [0.21, -0.14, 0.85, 0.03, -0.39, ... ] (1024 numbers)
"How to change a car tire"            →  [-0.67, 0.33, -0.12, 0.55, 0.08, ... ] (1024 numbers)

The first two are about the same concept (outcome-first marketing). Their numbers are almost identical. The third is about something completely different. Its numbers are completely different.

Why 1024 numbers?

This is the dimension of the embedding. More dimensions = more nuance. Think of it like describing a color:

768 dimensions (Rob's model) vs 1024 dimensions (your model) means your model captures slightly more nuance. Whether that matters depends on the data.

How search works with embeddings

When you ask "how do I sell to skeptical men?", the system:

  1. Embeds your question into 1024 numbers
  2. Compares those numbers to every stored principle's 1024 numbers
  3. Returns the principles whose numbers are most similar to your question's numbers

This is semantic search — search by meaning, not keywords. You don't need to use the exact words that are in the stored principle. "How do I sell to skeptical men?" finds "Lead with outcomes, not mechanisms" because the embeddings capture the semantic relationship.

This is why the embedding model choice is critical. The quality of search, retrieval, codex integration, and collective synthesis all depend on the embedding model understanding the nuances of your domain. A bad embedding model will return irrelevant results. A good one will surface exactly what you need.

Part 4
Why Q and Rob's Systems Can't Merge Right Now

Q's system uses BGE-M3 — produces 1024 numbers per text.
Rob's system uses nomic-embed-text — produces 768 numbers per text.

These are not compatible. You cannot compare a list of 1024 numbers to a list of 768 numbers. It's like trying to compare a 3D object to a 2D photograph of it — they represent the same thing but in different dimensional spaces. The math doesn't work.

This means:

This is the single most important infrastructure decision Meridian must make. Every build, every codex, every collective emission must use the same embedding model. Once you choose and clients start building, switching is astronomically expensive — you have to re-embed every single node across every single build.

Part 5
The Embedding Models Available

There are hundreds of embedding models. For Meridian, only a handful are realistic because we need: runs locally on CPU (sovereignty), open source (no API dependency), high quality (search must be accurate), and proven at scale.

ModelDimensionsSizeWhoQuality (MTEB)Speed (CPU)Notes
BAAI/bge-m3 1024 1.3 GB Beijing Academy of AI Very high (top 5 on MTEB retrieval) ~0.5s per text on CPU Q's current model. Multilingual. Supports dense + sparse + multi-vector. The most versatile option.
nomic-embed-text 768 274 MB Nomic AI Good (comparable to OpenAI ada-002) ~0.2s per text on CPU Rob's current model. Smaller, faster. Open source. Less nuanced than BGE-M3.
BAAI/bge-large-en-v1.5 1024 1.2 GB Beijing Academy of AI High ~0.4s per text English-only predecessor to BGE-M3. Slightly worse quality. Same dimensions.
sentence-transformers/all-MiniLM-L6-v2 384 80 MB Sentence Transformers Medium ~0.05s per text Very fast, very small, but 384-dim means less nuance. Fine for simple search, not enough for Meridian's knowledge density.
Cohere embed-v3 1024 API only Cohere Very high Fast (API) Top quality but requires API — breaks sovereignty. Not viable for air-gapped clients.
OpenAI text-embedding-3-large 3072 API only OpenAI Highest Fast (API) Best quality available but API-only + closed source. Non-starter for sovereignty. Also 3072-dim = 3x storage cost.
Snowflake/arctic-embed-l 1024 1.1 GB Snowflake High ~0.4s per text Strong retrieval performance. Open source. 1024-dim. Worth benchmarking against BGE-M3.
Alibaba/gte-Qwen2-7B-instruct 3584 14 GB Alibaba Near-best Very slow on CPU 7B parameter model — runs as a full LLM. Highest quality local option but requires GPU and massive resources. Not practical for client builds.

Part 6
The Tradeoffs — Pros, Cons, Consequences

BGE-M3 (Q's choice) — 1024-dim, 1.3 GB

ProsCons
  • Top-tier retrieval quality on MTEB benchmarks
  • Multilingual — works in 100+ languages (Will speaks 5)
  • Supports dense, sparse, AND multi-vector search
  • 1024-dim captures fine-grained semantic nuance
  • Already validated with 6,797 nodes in production
  • Open source (MIT license), runs on CPU
  • Active development by BAAI
  • 1.3 GB model — takes ~30s to load on first use
  • ~0.5s per embedding on CPU (fine for query, slow for bulk ingestion)
  • 1024-dim = more storage per node (4 KB per vector vs 3 KB for 768-dim)
  • Not the fastest option available

nomic-embed-text (Rob's choice) — 768-dim, 274 MB

ProsCons
  • 5x smaller (274 MB vs 1.3 GB) — loads faster, less RAM
  • 2.5x faster per embedding (~0.2s vs ~0.5s)
  • Good quality (comparable to OpenAI ada-002)
  • Open source
  • Native Ollama support (Rob's stack)
  • 768-dim = less storage per node
  • Lower semantic resolution (768 vs 1024 dimensions)
  • English-only — degraded quality for non-English text
  • Worse on MTEB retrieval benchmarks than BGE-M3
  • 768-dim is less standard — most modern models are moving to 1024+
  • No sparse or multi-vector support

Consequences of choosing wrong

If you pick a model and later need to switch:

Every single node — across every single build, every codex pack, every collective emission — must be re-embedded. For Q's current system, that's 6,797 nodes × 0.5s = ~1 hour. For a mature collective with 33 nodes at 10K principles each? 330,000 nodes × 0.5s = 46 hours of CPU time. Per build.

You cannot do a partial migration. Mixed embeddings are incompatible. It's all or nothing.

This is why you choose once and choose right.

Part 7
The Recommendation

BGE-M3 (1024-dim) is the right choice for Meridian

Here's why, factor by factor:

FactorBGE-M3 wins?Why
QualityYesHigher MTEB scores. Better retrieval means better agent responses, better codex integration, better synthesis.
MultilingualYesWill speaks French, Spanish, Russian, Italian. Clients may have knowledge in multiple languages. nomic is English-only.
ScalabilityYes1024-dim is becoming the industry standard. Future models will likely output 1024+. Starting at 768 means migrating later.
Production validationYes6,797 nodes, 27,338 edges, 60 evergreen frameworks already proven on BGE-M3. We know it works.
SpeedNonomic is 2.5x faster. But 0.5s vs 0.2s per query is imperceptible to a human. Only matters for bulk ingestion.
SizeNo1.3 GB vs 274 MB. Matters on a Raspberry Pi. Doesn't matter on a machine with 64 GB RAM.
StorageNo4 KB vs 3 KB per vector. At 10,000 nodes: 40 MB vs 30 MB. Negligible.

The speed and size advantages of nomic are real but irrelevant at Meridian's scale. The quality and multilingual advantages of BGE-M3 are decisive.

The migration path for Rob: Re-embed all 16,717 holons with BGE-M3. On his Mac hardware, this takes ~2-3 hours as a batch job. Run it once. Done. His holons keep all their content — only the vector field changes. Everything else (text, metadata, structure) is untouched.

What about future models?

New embedding models come out every few months. If a significantly better model appears in 2027, can we switch?

Theoretically yes, practically it's expensive. The cost is re-embedding everything. For the founding three, manageable (a few hours). For 33 nodes? A weekend project. For 100+? A major migration event.

The mitigation: the schema stores the raw text alongside the vector. You always have the original text. Re-embedding means reading every text field and running it through the new model. Nothing is lost — it's just compute time.

This is why storing the full text (not just the embedding) in NODE_SCHEMA is critical. The text is permanent. The embedding is a function of the text + the model. If the model changes, you regenerate. If the text is gone, you're dead.


Part 8
Rob's Current System in Detail

Rob's GHOSTNET uses a different stack at every level:

ComponentRob (GHOSTNET)Q (VOHU MANAH)Meridian Base Model
Embedding modelnomic-embed-text (768-dim)BGE-M3 (1024-dim)BGE-M3 (1024-dim)
Embedding viaOllama APIsentence-transformers (Python)sentence-transformers (Python)
StorageLanceDB (memories.lance)LanceDB (knowledge.db/)LanceDB (knowledge.db/)
SchemaSemi-structured (varying fields)NODE_SCHEMA (14 fields, enforced)NODE_SCHEMA v3 (14+ fields, enforced)
Collections4 (memories, dreams, synthesis, errors)13 (7 node + 4 legacy + 2 custom)TBD — minimum: knowledge + errors + dreams
GraphNone (flat holon structure)SQLite kg_edges (27,338 edges)SQLite kg_edges
InterfaceAnythingLLM workspaceTelegram + 4 Dash apps + CopilotTBD — likely Open WebUI + RAG plugin

What Rob needs to change for Meridian compatibility

  1. Re-embed with BGE-M3 — batch job, ~3 hours. All holon content preserved, only vectors change.
  2. Restructure holons to NODE_SCHEMA — map his fields to the standard 14 fields. Content that doesn't have a mechanism or when_not field gets those fields set to empty string. The schema requires the field to exist, not to be filled.
  3. Add errors.lance and dreams.lance to the standard collection list — these are Rob's contribution to v3. They become standard collections in the base model.

His security hardening, dream engine, and swarm architecture are all above the schema layer. They don't need to change. The schema is the data format. The applications built on top of it are independent.


Part 9
The Meridian Schema Architecture (Proposed)
TIER 1: NODE_SCHEMA (the protocol — exchangeable)
  Every principle, tactic, concept, example, error, dream_insight.
  14 fields + v3 additions (gravity_score, validation_count, error_count).
  This is what codex packs contain.
  This is what gets emitted to the collective.
  This is the interoperability guarantee.
  Embedding: BGE-M3, 1024-dim, local CPU.

TIER 2: SYSTEM SCHEMAS (internal — never exchanged)
  CONV_SCHEMA     — conversation records (session history)
  EVERGREEN_SCHEMA — synthesis output pages
  SNAPSHOT_SCHEMA  — system vital signs over time
  AGENT_ACTIVITY   — per-agent activity logs (for dreaming)
  These never leave the sovereign node.
  Each client's system tables are their own business.

TIER 3: GRAPH SCHEMA (relationship layer — exchangeable)
  kg_edges        — connections between NODE_SCHEMA nodes
  Edges ARE part of codex packs (they're the knowledge structure).
  edge_id, from_id, to_id, rel_type, weight, notes, created_at

LEGACY_SCHEMA: RETIRE
  os_context, reference_sites → migrate to NODE_SCHEMA
  or mark as system-only (not codex-compatible)
The one rule: If it participates in codex exchange or collective synthesis, it must be NODE_SCHEMA with BGE-M3 1024-dim embeddings. Everything else is internal plumbing that each build can handle however it wants. The schema is the protocol. The protocol is the product.